Skip to content

[Feature](Compaction) Add compaction profile manager with HTTP API for historical tracking#61374

Open
Yukang-Lian wants to merge 3 commits intoapache:masterfrom
Yukang-Lian:feature/compaction-profile
Open

[Feature](Compaction) Add compaction profile manager with HTTP API for historical tracking#61374
Yukang-Lian wants to merge 3 commits intoapache:masterfrom
Yukang-Lian:feature/compaction-profile

Conversation

@Yukang-Lian
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Currently, compaction execution metrics (input/output data sizes, row counts, merge latency, etc.) are lost once the compaction object is destructed. Operators have no way to inspect
historical compaction performance through the web interface — only the current running status is available.

This PR introduces a compaction profile history feature that allows operators to query detailed execution metrics of recent compactions via GET /api/compaction/profile, enabling
performance diagnosis and bottleneck identification without parsing BE logs.

ExampleGET /api/compaction/profile?tablet_id=12345&top_n=1:

{
  "status": "Success",
  "compaction_profiles": [
    {
      "compaction_id": 487,
      "compaction_type": "cumulative",
      "tablet_id": 12345,
      "start_time": "2025-07-15 14:02:31",
      "end_time": "2025-07-15 14:02:31",
      "cost_time_ms": 236,
      "success": true,
      "input_rowsets_data_size": 10706329,
      "input_rowsets_count": 5,
      "input_row_num": 52000,
      "input_segments_num": 5,
      "input_rowsets_index_size": 204800,
      "input_rowsets_total_size": 10911129,
      "merged_rows": 1200,
      "filtered_rows": 50,
      "output_rows": 50750,
      "output_rowset_data_size": 5033164,
      "output_row_num": 50750,
      "output_segments_num": 1,
      "output_rowset_index_size": 102400,
      "output_rowset_total_size": 5135564,
      "merge_rowsets_latency_ms": 180,
      "bytes_read_from_local": 10911129,
      "bytes_read_from_remote": 0,
      "output_version": "[12-16]"
    }
  ]
}

Features:

  • Filter by tablet_id to inspect compaction history of a specific tablet
  • Use top_n to limit results to the N most recent records
  • Failed compactions include status_msg for diagnosis, and output metrics are preserved even on failure (e.g., checksum mismatch) since they are the most valuable for debugging
  • Dynamically adjust compaction_profile_max_records at runtime to control memory usage (default 500 records, ~150KB), or set to 0 to disable
  • All compaction paths are tracked: local (base/cumulative/full/cold_data), cloud (base/cumulative/full/index_change), and single-replica

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26805 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 51666cf11dfcce64761217ac5261ccd0b169c421, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17683	4422	4304	4304
q2	q3	10644	823	523	523
q4	4669	362	256	256
q5	7545	1284	1020	1020
q6	178	176	152	152
q7	797	855	678	678
q8	9302	1475	1363	1363
q9	5000	4777	4674	4674
q10	6232	1936	1666	1666
q11	477	249	256	249
q12	698	583	466	466
q13	18028	2946	2179	2179
q14	229	238	214	214
q15	q16	729	733	678	678
q17	723	851	431	431
q18	6141	5480	5304	5304
q19	1106	988	593	593
q20	537	518	371	371
q21	4386	1842	1390	1390
q22	492	422	294	294
Total cold run time: 95596 ms
Total hot run time: 26805 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4809	4565	4641	4565
q2	q3	3943	4360	3829	3829
q4	876	1213	804	804
q5	4094	4459	4350	4350
q6	190	179	143	143
q7	1833	1690	1528	1528
q8	2473	2803	2582	2582
q9	7640	7565	7503	7503
q10	3736	4009	3594	3594
q11	514	445	452	445
q12	511	600	457	457
q13	2722	3170	2396	2396
q14	291	301	312	301
q15	q16	746	766	745	745
q17	1184	1345	1324	1324
q18	7299	6894	6725	6725
q19	933	937	954	937
q20	2075	2146	2028	2028
q21	4039	3685	3332	3332
q22	456	425	388	388
Total cold run time: 50364 ms
Total hot run time: 47976 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 167771 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 51666cf11dfcce64761217ac5261ccd0b169c421, data reload: false

query5	4330	629	522	522
query6	316	218	207	207
query7	4205	468	270	270
query8	350	279	235	235
query9	8751	2719	2686	2686
query10	519	365	338	338
query11	6985	5081	4863	4863
query12	174	131	128	128
query13	1249	441	331	331
query14	5748	3655	3417	3417
query14_1	2777	2768	2777	2768
query15	207	192	174	174
query16	965	493	452	452
query17	916	722	610	610
query18	2449	452	346	346
query19	213	208	186	186
query20	133	125	130	125
query21	219	131	112	112
query22	13233	14065	14925	14065
query23	16179	16001	15590	15590
query23_1	15727	15527	15337	15337
query24	7252	1644	1238	1238
query24_1	1230	1206	1249	1206
query25	570	502	439	439
query26	1242	275	151	151
query27	2769	490	302	302
query28	4496	1851	1850	1850
query29	891	592	499	499
query30	310	228	192	192
query31	1013	962	888	888
query32	84	74	73	73
query33	551	348	301	301
query34	902	880	511	511
query35	654	676	616	616
query36	1122	1115	970	970
query37	149	112	88	88
query38	2996	2950	2861	2861
query39	870	843	812	812
query39_1	776	779	800	779
query40	231	149	134	134
query41	62	60	58	58
query42	254	255	253	253
query43	238	243	217	217
query44	
query45	192	187	186	186
query46	864	989	613	613
query47	2104	2151	2047	2047
query48	301	327	233	233
query49	630	467	377	377
query50	675	287	218	218
query51	4125	4005	4037	4005
query52	262	268	255	255
query53	282	334	288	288
query54	301	272	262	262
query55	93	89	80	80
query56	311	323	310	310
query57	1953	1696	1646	1646
query58	311	269	283	269
query59	2782	2945	2765	2765
query60	335	340	319	319
query61	188	151	150	150
query62	626	591	516	516
query63	313	278	268	268
query64	5165	1272	1008	1008
query65	
query66	1469	452	355	355
query67	24357	24293	24327	24293
query68	
query69	405	306	299	299
query70	949	986	926	926
query71	347	308	303	303
query72	2709	2643	2132	2132
query73	535	556	321	321
query74	9647	9577	9405	9405
query75	2867	2778	2470	2470
query76	2276	1045	701	701
query77	364	389	325	325
query78	11061	11169	10480	10480
query79	1134	777	611	611
query80	1321	638	534	534
query81	544	269	225	225
query82	1127	152	122	122
query83	345	261	242	242
query84	300	131	98	98
query85	899	467	446	446
query86	429	339	297	297
query87	3218	3111	3042	3042
query88	3566	2663	2690	2663
query89	414	377	352	352
query90	2055	179	177	177
query91	164	152	138	138
query92	77	74	69	69
query93	903	870	487	487
query94	639	311	289	289
query95	609	356	317	317
query96	650	516	233	233
query97	2525	2498	2429	2429
query98	240	223	217	217
query99	1031	990	911	911
Total cold run time: 249038 ms
Total hot run time: 167771 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.64% (19717/37459)
Line Coverage 36.20% (184191/508754)
Region Coverage 32.39% (142136/438811)
Branch Coverage 33.56% (62151/185167)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.31% (26879/36664)
Line Coverage 56.72% (287603/507040)
Region Coverage 54.05% (239374/442859)
Branch Coverage 55.80% (103589/185649)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants